Transliteration using phrase based SMT approach on substrings
نویسنده
چکیده
Translation of named entities (NEs), such as person names, organization names and location names is crucial for cross lingual information retrieval, machine translation, and many other natural language processing applications. Newly named entities are introduced on daily basis in newswire and this greatly complicates the translation task. Named Entities translation between languages having different orthographic basis is more complex than translation between similar languages; this is due to the fact that languages with different orthographic basis may have different mapping between consonants and vowels. For example when translating English names to Arabic names many problems arise due to lexical difference. Firstly, Arabic deploys unwritten forms of short vowels in contrary with English names where short vowels are usually written. In such cases, Arabic short vowels (Fathah, Kasrah and Dammah) are being pronounced and should be used in the target language. Secondly, some Arabic consonants may be mapped to various English consonants, Examples: (س I s, c), (ب I b, p), (ك I k, c, ck), and others are mapped to more than one consonant, Ex. (ش I sh, ch), (th ث ,ذ) which makes the problem a kind of many to many mapping task. Finally, a general problem of Named Entities transliteration is that it is always preferable to produce the most commonly used form of the name. In this paper we introduce a phrase based Arabic to English transliteration system to align Arabic substrings to English substrings based on parallel corpus of Aligned named Entities. A cascaded spelling suggested module is proposed to solve the problems that are beyond the phrase based transliteration limitations. The Spelling suggestion is applied over the phrase based transliteration system output to introduce best spelling correction for the transliterated name. This step makes our system more biased towards commonly used form of the name rather than the pure morphological representation.
منابع مشابه
Language Independent Transliteration System Using Phrase-based SMT Approach on Substrings
Everyday the newswire introduce events from all over the world, highlighting new names of persons, locations and organizations with different origins. These names appear as Out of Vocabulary (OOV) words for Machine translation, cross lingual information retrieval, and many other NLP applications. One way to deal with OOV words is to transliterate the unknown words, that is, to render them in th...
متن کاملTransliteration of Name Entity via Improved Statistical Translation on Character Sequences
Transliteration of given parallel name entities can be formulated as a phrase-based statistical machine translation (SMT) process, via its routine procedure comprising training, optimization and decoding. In this paper, we present our approach to transliterating name entities using the loglinear phrase-based SMT on character sequences. Our proposed work improves the translation by using bidirec...
متن کاملA Bayesian model of bilingual segmentation for transliteration
In this paper we propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the o...
متن کاملExperiences with English-Hindi, English-Tamil and English-Kannada Transliteration Tasks at NEWS 2009
We use a Phrase-Based Statistical Machine Translation approach to Transliteration where the words are replaced by characters and sentences by words. We employ the standard SMT tools like GIZA++ for learning alignments and Moses for learning the phrase tables and decoding. Besides tuning the standard SMT parameters, we focus on tuning the Character Sequence Model (CSM) related parameters like or...
متن کاملEnglish-Korean Named Entity Transliteration Using Statistical Substring-based and Rule-based Approaches
This paper describes our approach to English-Korean transliteration in NEWS 2011 Shared Task on Machine Transliteration. We adopt the substring-based transliteration approach which group the characters of named entity in both source and target languages into substrings and then formulate the transliteration as a sequential tagging problem to tag the substrings in the source language with the su...
متن کامل